I have chosen to KEEP my Midterm as part of my final grade. The analysis below will describe how I have come to this conclusion. Keep in mind that my analysis creates a model that will be used in the Conclusion Tab to determine my final test score. Knowing my predicted final test score allowed me to come to a logical conclusion. It is also important to note that I used the dataset provided by Bro. Saunders containing previous students scores in my analysis.
My first step is to look at the data to see what I’m working with and the variables included. The datatable below shows a snippet of the data.
datatable(midterm, options=list(lengthMenu = c(5,10,25)))
From this snippet there are a couple things that stand out. The data looks good for the most part. There are 9 variables as shown in italics below.
colnames(midterm) %>%
pander()
Gender, Midterm, FinalExam, AssessmentQuizCompletion, AssessmentQuizActual, AnalysesTheory, PeerReviews, ClassActivities and SkillQuizzes
Something that could cause issues later on is the missing values in the dataset. Rather than filter these rows out, lets go ahead and fill these in with “0”s. It could be that their other information is useful in the analysis. If we see anything different, we’ll filter these out later.
A updated dataset is shown below:
midterm <- midterm %>%
replace(is.na(.), 0)
datatable(midterm, options=list(lengthMenu = c(5,10,25)))
With the data set ready to go. Lets take a quick look at the initial pairs plot to see how the variables, including the midterm variable, looks in relation to the final test score. The first pairs plot is shown below:
pairs(midterm, panel = panel.smooth)
There are several variables that seem interesting; in particular, the Gender, Midterm, AnalysesTheory, AssessmentQuizActual and SkillQuizzes. The two that stand out the most are the Midterm and AnalysesTheory. I’m going to start with the Midterm variable and since the data seems to have a curve to it, we’ll just include a squared term now and see if it is significant.
Running the regression we get the following results:
lm1 <- lm(FinalExam+1~Midterm + I(Midterm^2), data = midterm)
summary(lm1)
##
## Call:
## lm(formula = FinalExam + 1 ~ Midterm + I(Midterm^2), data = midterm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.141 -7.661 4.859 11.263 31.657
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.692345 8.859572 4.254 8.76e-05 ***
## Midterm -0.059799 0.399867 -0.150 0.882
## I(Midterm^2) 0.006058 0.004306 1.407 0.165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.85 on 52 degrees of freedom
## Multiple R-squared: 0.2841, Adjusted R-squared: 0.2566
## F-statistic: 10.32 on 2 and 52 DF, p-value: 0.0001682
The summary shows that Midterm is not significant in either term. The fit of the model is also poor with our R-Squared being 0.2841.
However, despite this, I’m curious. Lets look at the residuals on the pairs plot to see if anything stands out.
pairs(cbind(R=lm1$res, Fit = lm1$fit, midterm), panel = panel.smooth)
AnalysesTheory and AssessmentQuizActual still stand out to me. Starting with AnalysesTheory, this seems to be a simple line, so we’lll add that into our model, with an interaction with Midterm.
We get the following results.
lm2 <- lm(FinalExam+1~Midterm + I(Midterm^2) + AnalysesTheory + Midterm:AnalysesTheory, data = midterm)
summary(lm2)
##
## Call:
## lm(formula = FinalExam + 1 ~ Midterm + I(Midterm^2) + AnalysesTheory +
## Midterm:AnalysesTheory, data = midterm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.414 -4.956 4.758 10.371 25.465
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.867933 14.205293 -0.483 0.630868
## Midterm 0.044512 0.359632 0.124 0.901994
## I(Midterm^2) 0.009103 0.004018 2.265 0.027849 *
## AnalysesTheory 0.850762 0.228478 3.724 0.000499 ***
## Midterm:AnalysesTheory -0.007951 0.003285 -2.420 0.019179 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.38 on 50 degrees of freedom
## Multiple R-squared: 0.4727, Adjusted R-squared: 0.4306
## F-statistic: 11.21 on 4 and 50 DF, p-value: 1.44e-06
Okay, so it looks like our Midterm squared term is now significant. Our intercept and Midterm terms being the only non-significant terms. Lets take a look at the pairs plot again and see if there is anything else.
pairs(cbind(R=lm2$res, Fit = lm2$fit, midterm), panel = panel.smooth)
From this pairs plot, I’m beginning to think that we may need include a quadratic AnalysesTheory term. I’m just going to throw it in there and see.
lm3 <- lm(FinalExam+1~Midterm + I(Midterm^2) * AnalysesTheory, data = midterm)
summary(lm3)
##
## Call:
## lm(formula = FinalExam + 1 ~ Midterm + I(Midterm^2) * AnalysesTheory,
## data = midterm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.669 -4.310 4.591 11.066 25.383
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.871e+00 1.376e+01 -0.427 0.671519
## Midterm -5.873e-01 3.933e-01 -1.493 0.141665
## I(Midterm^2) 1.699e-02 5.919e-03 2.870 0.006000 **
## AnalysesTheory 8.335e-01 2.191e-01 3.805 0.000388 ***
## I(Midterm^2):AnalysesTheory -9.787e-05 3.974e-05 -2.462 0.017287 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.35 on 50 degrees of freedom
## Multiple R-squared: 0.4747, Adjusted R-squared: 0.4326
## F-statistic: 11.29 on 4 and 50 DF, p-value: 1.319e-06
Okay, so as you can see, everything except for Midterm is significant. Looking at the new pairs plot..
pairs(cbind(R=lm3$res, Fit = lm3$fit, midterm), panel = panel.smooth)
And AnalysesTheory still interests me. Lets add a cubed term to the model and see what happens.
lm4 <- lm(FinalExam+1~Midterm + I(Midterm^2) * AnalysesTheory + I(AnalysesTheory^3), data = midterm)
summary(lm4)
##
## Call:
## lm(formula = FinalExam + 1 ~ Midterm + I(Midterm^2) * AnalysesTheory +
## I(AnalysesTheory^3), data = midterm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.697 -4.429 2.118 10.725 28.774
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.050e+00 1.408e+01 0.075 0.94081
## Midterm -4.753e-01 3.910e-01 -1.215 0.23004
## I(Midterm^2) 1.685e-02 5.804e-03 2.903 0.00553 **
## AnalysesTheory 4.885e-01 2.930e-01 1.667 0.10181
## I(AnalysesTheory^3) 3.606e-05 2.082e-05 1.732 0.08958 .
## I(Midterm^2):AnalysesTheory -1.151e-04 4.021e-05 -2.861 0.00620 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.01 on 49 degrees of freedom
## Multiple R-squared: 0.505, Adjusted R-squared: 0.4545
## F-statistic: 9.997 on 5 and 49 DF, p-value: 1.232e-06
So it is not significant. The new Pairs plots..
pairs(cbind(R=lm4$res, Fit = lm4$fit, midterm), panel = panel.smooth)
Lets take out a term at a time here. I’m going to go ahead and throw out the Midterm term first.
lm5 <- lm(FinalExam+1~I(Midterm^2) + AnalysesTheory + I(AnalysesTheory^3) + I(Midterm^2):AnalysesTheory, data = midterm)
summary(lm5)
##
## Call:
## lm(formula = FinalExam + 1 ~ I(Midterm^2) + AnalysesTheory +
## I(AnalysesTheory^3) + I(Midterm^2):AnalysesTheory, data = midterm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.279 -6.323 0.995 10.484 30.319
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.655e+00 1.408e+01 0.189 0.851219
## I(Midterm^2) 1.062e-02 2.744e-03 3.871 0.000315 ***
## AnalysesTheory 3.373e-01 2.665e-01 1.266 0.211489
## I(AnalysesTheory^3) 4.024e-05 2.063e-05 1.951 0.056717 .
## I(Midterm^2):AnalysesTheory -9.578e-05 3.714e-05 -2.579 0.012889 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.09 on 50 degrees of freedom
## Multiple R-squared: 0.4901, Adjusted R-squared: 0.4493
## F-statistic: 12.01 on 4 and 50 DF, p-value: 6.462e-07
Lets take out the AnalysesTheory now..
lm6 <- lm(FinalExam+1~I(Midterm^2) + I(AnalysesTheory^3) + I(Midterm^2):AnalysesTheory, data = midterm)
summary(lm6)
##
## Call:
## lm(formula = FinalExam + 1 ~ I(Midterm^2) + I(AnalysesTheory^3) +
## I(Midterm^2):AnalysesTheory, data = midterm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.243 -6.083 0.539 11.074 31.879
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.718e+01 8.208e+00 2.093 0.041369 *
## I(Midterm^2) 9.571e-03 2.631e-03 3.639 0.000639 ***
## I(AnalysesTheory^3) 5.807e-05 1.516e-05 3.831 0.000352 ***
## I(Midterm^2):AnalysesTheory -7.926e-05 3.497e-05 -2.266 0.027693 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.19 on 51 degrees of freedom
## Multiple R-squared: 0.4737, Adjusted R-squared: 0.4428
## F-statistic: 15.3 on 3 and 51 DF, p-value: 3.16e-07
Okay, well all of our terms are significant but lets take a look at the pairs plot
pairs(cbind(R=lm6$res, Fit = lm6$fit, midterm), panel = panel.smooth)
It appears to me that AssementQuizActual as a pattern, lets add that in now..
lm7 <- lm(FinalExam+1~I(Midterm^2) + I(AnalysesTheory^3) + I(Midterm^2):AnalysesTheory + AssessmentQuizActual + I(Midterm^2):AssessmentQuizActual + I(AssessmentQuizActual^3), data = midterm)
summary(lm7)
##
## Call:
## lm(formula = FinalExam + 1 ~ I(Midterm^2) + I(AnalysesTheory^3) +
## I(Midterm^2):AnalysesTheory + AssessmentQuizActual + I(Midterm^2):AssessmentQuizActual +
## I(AssessmentQuizActual^3), data = midterm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.908 -7.905 1.535 8.362 34.567
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.694e+01 1.141e+01 1.484 0.144302
## I(Midterm^2) 8.583e-03 3.040e-03 2.823 0.006904
## I(AnalysesTheory^3) 6.078e-05 1.521e-05 3.996 0.000221
## AssessmentQuizActual -1.885e-02 3.050e-01 -0.062 0.950971
## I(AssessmentQuizActual^3) 2.178e-05 3.474e-05 0.627 0.533722
## I(Midterm^2):AnalysesTheory -9.754e-05 3.558e-05 -2.742 0.008567
## I(Midterm^2):AssessmentQuizActual 2.952e-05 4.293e-05 0.688 0.494946
##
## (Intercept)
## I(Midterm^2) **
## I(AnalysesTheory^3) ***
## AssessmentQuizActual
## I(AssessmentQuizActual^3)
## I(Midterm^2):AnalysesTheory **
## I(Midterm^2):AssessmentQuizActual
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.57 on 48 degrees of freedom
## Multiple R-squared: 0.5398, Adjusted R-squared: 0.4823
## F-statistic: 9.384 on 6 and 48 DF, p-value: 8.249e-07
Well, its clear to see that AssessmentQuizActual does not have any significance to the model. Going back to the previous model.
lm8 <- lm(FinalExam+1~I(Midterm^2) + I(AnalysesTheory^3) + I(Midterm^2):AnalysesTheory, data = midterm)
summary(lm8)
##
## Call:
## lm(formula = FinalExam + 1 ~ I(Midterm^2) + I(AnalysesTheory^3) +
## I(Midterm^2):AnalysesTheory, data = midterm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.243 -6.083 0.539 11.074 31.879
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.718e+01 8.208e+00 2.093 0.041369 *
## I(Midterm^2) 9.571e-03 2.631e-03 3.639 0.000639 ***
## I(AnalysesTheory^3) 5.807e-05 1.516e-05 3.831 0.000352 ***
## I(Midterm^2):AnalysesTheory -7.926e-05 3.497e-05 -2.266 0.027693 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.19 on 51 degrees of freedom
## Multiple R-squared: 0.4737, Adjusted R-squared: 0.4428
## F-statistic: 15.3 on 3 and 51 DF, p-value: 3.16e-07
And after one last look, I don’t see anything else that stands out to me.
pairs(cbind(R=lm6$res, Fit = lm6$fit, midterm), panel = panel.smooth)
I think that we have the right components of our model. Lets take a look at the residuals vs Fitted plot to see what’s going on there.
plot(lm6, which =1)
This looks okay, not great. Lets try removing the three outlier labeled 42 to see if that helps our improve our linearity/constant variance at all. A new lm summary and residuals vs fitted plot are shown below.
midterm1 <- midterm %>%
filter(row_number() != 42)
lm8 <- lm(FinalExam+1~I(Midterm^2) + I(AnalysesTheory^3) + I(Midterm^2):AnalysesTheory, data = midterm1)
summary(lm8)
##
## Call:
## lm(formula = FinalExam + 1 ~ I(Midterm^2) + I(AnalysesTheory^3) +
## I(Midterm^2):AnalysesTheory, data = midterm1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.511 -6.543 0.373 9.712 30.472
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.717e+01 7.187e+00 2.389 0.02072 *
## I(Midterm^2) 1.124e-02 2.340e-03 4.805 1.45e-05 ***
## I(AnalysesTheory^3) 5.809e-05 1.327e-05 4.376 6.15e-05 ***
## I(Midterm^2):AnalysesTheory -9.789e-05 3.096e-05 -3.162 0.00267 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.05 on 50 degrees of freedom
## Multiple R-squared: 0.5337, Adjusted R-squared: 0.5058
## F-statistic: 19.08 on 3 and 50 DF, p-value: 2.21e-08
b <- lm8$coefficients
plot(lm8, which =1)
Removing the outliers did help our Constant variance. The linearity is still not great but should be alright.
Everything looks pretty good, but just because I’m a curious soul, lets try a boxCox to check for a possible transformation..
boxCox(lm8)
Well, it looks like we’ll stick to our current model as the boxCox suggests no transformation.
Therefore our model for predicting a final test score is:
\[ \underbrace{Y_i}_\text{Final Grade} = 17.1690306 + 0.0112411\underbrace{X_{1i}^2}_\text{Midterm} + 5.8087358\times 10^{-5}\underbrace{X_{1i}^2X_{2i}}_\text{AnalysesTheory} + -9.7888963\times 10^{-5}\underbrace{X_{1i}^3X_{2i}}_\text{Interaction} \]
As determined in the “Determining the Model” tab, our model is:
\[ \underbrace{Y_i}_\text{Final Grade} = 17.1690306 + 0.0112411\underbrace{X_{1i}^2}_\text{Midterm} + 5.8087358\times 10^{-5}\underbrace{X_{1i}^2X_{2i}}_\text{AnalysesTheory} + -9.7888963\times 10^{-5}\underbrace{X_{1i}^3X_{2i}}_\text{Interaction} \]
Using this model I will now input my other score information to determine my final score. I scored a 96 on my Midterm and currently have 100% on my Analyses.
p <- predict(lm3, data.frame(Midterm = 96, AnalysesTheory = 100))-1
My predicted Final Exam score is: 86.461289
Before going any further, lets look at the prediction intervals to see just how much this score can vary.
p_i <- predict(lm3, data.frame(Midterm = 96, AnalysesTheory = 100), interval = "prediction")-1
p_i %>%
pander()
| fit | lwr | upr |
|---|---|---|
| 86.46 | 48.86 | 124.1 |
Okay, the prediction intervals show that my final exam score can still vary quite a bit. Therefore I should put too much trust in this model prediction. However, lets still run through the logic.
Assuming that I actually do score a 86.461289/100 on my final.
Should I dropped my Midterm, I would have a 86.461289 final percentage.
Should I keep my Midterm, I would have (70%)*86.461289 + (30%)*96 = 89.3229 final percentage.
Assuming that I actually do score a 48.8585981/100 on my final.
Should I dropped my Midterm, I would have a 48.8585981 final percentage.
Should I keep my Midterm, I would have (70%)*48.8585981 + (30%)*96 = 60.13941 final percentage.
Looking at both possibilities, I’d say that it is in my best interest to keep my Midterm.